Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines
نویسندگان
چکیده
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending HuSpaCy toolkit with several improvements to its architecture. Compared existing NLP tools Hungarian, all our pipelines feature basic steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological lemmatization, dependency parsing named entity recognition high accuracy throughput. We thoroughly evaluated proposed enhancements, compared demonstrated competitive new preprocessing steps. All experiments are reproducible freely available under permissive license.
منابع مشابه
Multilingual, Efficient and Easy NLP Processing with IXA Pipeline
IXA pipeline is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology. It aims at lowering the barriers of using NLP technology both for research purposes and for small industrial developers and SMEs by offering robust and efficient linguistic annotation to both researchers and non-NLP experts. IXA pipeline can be used “as is” or exploit its m...
متن کاملAutomatic Evaluation and Composition of NLP Pipelines with Web Services
We describe the innovative use of describing an existing natural language “pipeline” using the Semantic Web, and focus on how the performance and results of the components may be described. Earlier work has shown how NLP Web Services can be automatically composed via Semantic Web Service composition, and once the results of NLP components can be stored directly, they can also be used to direct ...
متن کاملBuilding reliable and efficient data transfer and processing pipelines
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Ou...
متن کاملJoining Statistics with NLP for Text Categorization
Automatic news categorization systems have produced high accuracy, consistency, and flexibility using some natural language processing techniques. These knowledge-based categorization methods are more powerful and accurate than statistical techniques. However, the phrasal pre-processing and pattern matching methods that seem to work for categorization have the disadvantage of requiring a fair a...
متن کاملImproving Biomedical Text Categorisation with NLP
Background: Text categorisation has been used in bioinformatics to help identify documents containing protein-protein interactions. Standard text categorisation methods have used the bag-of-words approach with little input from NLP. While this has proved effective in the past, there is some evidence that the techniques are not adequate in some biological domains. Here we examine how chunking, n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2023
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-40498-6_6